Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

I look forward to learning a lot about R and data science.

my link is https://coolnesss.github.io/IODS-project/

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Fri Dec  3 15:24:19 2021"

The text continues here.


Chapter 2 student data

Describe the work you have done this week and summarize your learning.

date()
## [1] "Fri Dec  3 15:24:19 2021"

Lets start by reading the data

analysis = read.csv('data/data.csv', header = TRUE)
print(str(analysis))
## 'data.frame':    166 obs. of  7 variables:
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
## NULL

The data contains student questionnaire answers and attributes such as gender, age, etc. The variables describe their aptitudes, strengths and weaknesses, such as organization, reading styles, study styles, etc.

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

ggpairs(analysis, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

The pairwise plot indicates that for example gender has little effect on Points, with the correlation also being small. On the other hand there is a large correlation between Points and Attitude which seems to suggest that people who were motivated to study got more points in the exam. The age distribution is highly skewed, with most respondents being close to the age of 20. Otherwise, most of the other columns roughly follow a Gaussian distribution.

Lets choose the following three attributes to fit our regression model on: Attitude, Age and Gender. In other words we are interested in knowing if we can explain the exam points using these three variables.

linear_regression = lm(Points ~ Age+Attitude+gender, data = analysis)
summary(linear_regression)
## 
## Call:
## lm(formula = Points ~ Age + Attitude + gender, data = analysis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.42910    2.29043   5.863 2.48e-08 ***
## Age         -0.07586    0.05367  -1.414    0.159    
## Attitude     0.36066    0.05932   6.080 8.34e-09 ***
## genderM     -0.33054    0.91934  -0.360    0.720    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08

It would seem that indeed Attitude has a large coefficient meaning it contributes highly to the prediction. The significance test related to this predictor gives a very small value, meaning that it is unlikely that this result is by chance. On the other hand, while gender has a large negative coefficient, meaning being male reduces the amount of points recieved under the model, its corresponding p-value is high, meaning there is large variance in this result. Finally, age also has a moderately large p-value while also having a negligible effect.

Seeing that age and gender are not reliable predictors of exam scores, lets remove them from the model and try again with only attitude as the predictor.

linear_regression = lm(Points ~ Attitude, data = analysis)
summary(linear_regression)
## 
## Call:
## lm(formula = Points ~ Attitude, data = analysis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now we can see that attitude is indeed a strong predictor of exam points and this result is statistically significant, i.e unlikely to be attributed to chance.

Finally lets interpret the multiple R-squared statistic of the model. An R-squared of only 0.19 means that only 19% of the data fit the model. Importantly, after taking out the two variables Age and Gender, our R-squared dropped by only 0.01 meaning they did not contribute much to fitting the data.

We also need to visualize the model using plots.

par(mfrow=c(2,2))
plot(linear_regression, which = c(1,2,5))

The first residuals vs fitted plot shows us that apart from the extreme values near 28, the residuals are evenly distributed, which means that there does not seem to be an underlying non-linearity that would affect the linearity assumption of our model. The Normal QQ plot shows that our sample data slightly skewed, since a non-skewed data would have a straight line plot. This suggests the data does not come from a normal distribution. Finally, the residuals vs leverage plot tells us how sensitive our prediction is to a change in the real exam points. We can also see if there are any datapoints for which their removal would cause a large change in the model. In this case there are no such points, since there is no visible dotted red line.

Chapter 3 alcohol data

date()
## [1] "Fri Dec  3 15:24:23 2021"
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data = read.csv('data/alc_data.csv')
print(str(data))
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
## NULL
print(max(data$age))
## [1] 22
print( min(data$age))
## [1] 15

The data consists of a questionnaire done to students of ages 15-22. Various questions about their background are asked as well as what they do on their free time, how much alcohol they consume and so forth. Also their grades are given here.

We will choose 4 variables to try to predict alcohol use using those variables. I will choose “failures” since failing a class might indicate alcohol use. In addition “freetime” will be chosen since people who have a lot of free time might be drinking. Age will also be chosen. Finally, I will choose “studytime” since people who study more dont have time to drink.

library(GGally)
library(ggplot2)
data$freetime = as.ordered(data$freetime)
data$studytime = as.ordered(data$studytime)
data$failures = as.ordered(data$failures)

data_subset = data[c('alc_use', 'freetime', 'studytime', 'age', 'failures')]

ggpairs(data_subset, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

Here we have the relationship between the chosen variables and alcohol use. From the top row we can see that in particular the number of failures correlates with alchohol use: people with 3 failures had the most alcohol use. Age also seems to slightly correlate with alcohol use, with older students drinking more. People who study the most have the least amount of alcohol use. As for freetime, the connection is less clear. Therefore the chosen variables seem to be pretty good at describing alcohol use.

glm_data = data[c('high_use', 'freetime', 'studytime', 'age', 'failures')]
glm_data$failures = ordered(glm_data$failures)
glm_data$age = as.numeric(glm_data$age)
glm_data$studytime = as.ordered(glm_data$studytime)
glm_data$freetime = as.ordered(glm_data$freetime)
glm_model = glm(high_use ~ ., data=glm_data, family = "binomial")
glm_model
## 
## Call:  glm(formula = high_use ~ ., family = "binomial", data = glm_data)
## 
## Coefficients:
## (Intercept)   freetime.L   freetime.Q   freetime.C   freetime^4  studytime.L  
##    -3.61634      1.13774     -0.37684     -0.01943     -0.16128     -0.98268  
## studytime.Q  studytime.C          age   failures.L   failures.Q   failures.C  
##     0.37394      0.50550      0.17260      0.79327     -0.17619      0.34081  
## 
## Degrees of Freedom: 369 Total (i.e. Null);  358 Residual
## Null Deviance:       452 
## Residual Deviance: 413.3     AIC: 437.3

Here we cast all the features into ordered values except age, since they have a clear ordering (freetime 5 is more than freetime 4)

print(exp(coef(glm_model)))
## (Intercept)  freetime.L  freetime.Q  freetime.C  freetime^4 studytime.L 
##  0.02688092  3.11970354  0.68602398  0.98075575  0.85105503  0.37430663 
## studytime.Q studytime.C         age  failures.L  failures.Q  failures.C 
##  1.45344515  1.65781273  1.18838494  2.21061040  0.83845977  1.40609193
print(confint(glm_model))
## Waiting for profiling to be done...
##                   2.5 %      97.5 %
## (Intercept) -7.31978014  0.02252094
## freetime.L   0.15826962  2.40079162
## freetime.Q  -1.45950625  0.47225107
## freetime.C  -0.71317341  0.74441514
## freetime^4  -0.65580945  0.33331230
## studytime.L -1.87775566 -0.23173544
## studytime.Q -0.38618233  1.07809059
## studytime.C -0.08866794  1.15363489
## age         -0.03309116  0.38217471
## failures.L  -0.72327871  2.87497442
## failures.Q  -1.44760231  1.45410711
## failures.C  -0.67903331  1.40836278

Since we specified some of the features as having an order, R has fit each variable with a series of polynomial functions (L for linear, Q for quadratic and so on). Lets focus on the linear ones. We can interpret the exponentials of the coefficients as odds ratios for each feature. This means that Failures, for example, has an odds ratio of 2.2/1 meaning it is twice as likely that a person who failed a class uses a high amount of alchohol. Also, freetime seems to have a massive impact on high alcohol use as well, with people who have a lot of freetime being 3 times as likely to use a lot of alcohol. Perhaps somewhat surprisingly age doesnt play a big role.

alc = glm_data
probabilities <- predict(glm_model, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = alc$probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, age, freetime, studytime, high_use, probability, prediction) %>% tail(10)
##     failures age freetime studytime high_use probability prediction
## 361        1  18        4         1     TRUE   0.7135619       TRUE
## 362        0  18        3         1     TRUE   0.4044013      FALSE
## 363        0  18        4         1     TRUE   0.5192315       TRUE
## 364        0  18        2         2     TRUE   0.2633834      FALSE
## 365        0  18        3         1    FALSE   0.4044013      FALSE
## 366        0  18        3         1     TRUE   0.4044013      FALSE
## 367        0  18        2         3    FALSE   0.1046941      FALSE
## 368        1  19        4         1     TRUE   0.7475035       TRUE
## 369        2  19        4         2     TRUE   0.6505777       TRUE
## 370        1  19        3         1    FALSE   0.6504956       TRUE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   244   15
##    TRUE     93   18

We can see from the matrix that we predicted a TRUE value 18 times when it was actually true, and 15 times when it was actually false. On the other hand, we predicted false 244 times for false samples and 93 times for true samples. This indicates that the model is not very good, but still decent. The training error can be calculated

c_mat = table(high_use = alc$high_use, prediction = alc$prediction)
print('accuracy')
## [1] "accuracy"
print(1 - (c_mat[1, 2] + c_mat[2, 1]) / sum(c_mat))
## [1] 0.7081081
print('majority class')
## [1] "majority class"
print(mean(glm_data$high_use))
## [1] 0.3

Always guessing “False” would get 0.7 accuracy so we are barely better than that. I guess the model is bad after all.

Boston housing clustering

date()
## [1] "Fri Dec  3 15:24:26 2021"
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data = Boston
str(data)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The Boston housing dataset consists of different aspects of the towns or provinces in Boston. Theres crime numbers, tax rates, distances to different main places, ages of buildings and even the proportion of black people in the district. The main task is to predict the nitric oxide level in a part of the city.

library(GGally)
library(ggplot2)

data$chas = as.factor(data$chas)
ggpairs(data, mapping = aes(colour=chas), lower = list(combo = wrap("facethist", bins = 20)))

Here chas is a binary variable so I plotted it separately. The variables nox and age seem to be highly correlated, which indicates that the presence of older apartments increases nitrous oxide levels. In addition, distance to employment centers seems to be negatively correlated with the nitrous oxide levels, so areas closer to the center have less problems in this regard. Also curiously tax rate seems to be positively correlated with crime in the area.

library(dplyr)
colnames(data)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
boston_scaled <- data %>% mutate_at(c("crim",    "zn" ,     "indus"  , "nox"    , "rm"   ,   "age"  ,   "dis" ,    "rad"   ,  "tax" ,    "ptratio" ,"black"  , "lstat",  "medv"), ~(scale(.) %>% as.vector))
str(boston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  -0.419 -0.417 -0.417 -0.416 -0.412 ...
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...

Here we did not scale the variable chas since its binary. Scaling the proportional values also seems odd but I did it anyway since the instructions asked. Now lets create the factor variable

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)

# look at the table of the new factor crime


# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
str(boston_scaled$crime)
##  Factor w/ 4 levels "[-0.419,-0.411]",..: 1 1 1 1 1 1 2 2 2 2 ...
## 75% of the sample size
smp_size <- floor(0.8 * nrow(boston_scaled))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(boston_scaled)), size = smp_size)

train <- boston_scaled[train_ind, ]
test <- boston_scaled[-train_ind, ]
print(nrow(train))
## [1] 404
print(nrow(test))
## [1] 102

Now we have partitioned the data into training and test sets. Next lets fit LDA

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
## [-0.419,-0.411]  (-0.411,-0.39] (-0.39,0.00739]  (0.00739,9.92] 
##       0.2549505       0.2500000       0.2500000       0.2450495 
## 
## Group means:
##                          zn      indus      chas1        nox          rm
## [-0.419,-0.411]  1.01866313 -0.9066422 0.04854369 -0.8835724  0.38666334
## (-0.411,-0.39]  -0.07141687 -0.3429155 0.07920792 -0.5768545 -0.09882533
## (-0.39,0.00739] -0.40148591  0.2162741 0.11881188  0.3637157  0.12480815
## (0.00739,9.92]  -0.48724019  1.0149946 0.06060606  1.0481437 -0.47733231
##                        age        dis        rad        tax    ptratio
## [-0.419,-0.411] -0.9213895  0.9094324 -0.6819317 -0.7458486 -0.4234280
## (-0.411,-0.39]  -0.3254849  0.3694282 -0.5395408 -0.4205644 -0.1079710
## (-0.39,0.00739]  0.4564260 -0.3720478 -0.4349280 -0.3191090 -0.2716959
## (0.00739,9.92]   0.8274496 -0.8601246  1.6596029  1.5294129  0.8057784
##                      black        lstat        medv
## [-0.419,-0.411]  0.3729083 -0.766883775  0.47009410
## (-0.411,-0.39]   0.3103406 -0.164921798  0.01548761
## (-0.39,0.00739]  0.1049654  0.009720124  0.19440747
## (0.00739,9.92]  -0.6383964  0.900379309 -0.71294711
## 
## Coefficients of linear discriminants:
##                  LD1         LD2        LD3
## zn       0.148390645  0.74870532 -1.0874785
## indus    0.040971465 -0.38126652 -0.1619456
## chas1    0.009688321  0.15606070  0.6690362
## nox      0.312458150 -0.67564471 -1.3104018
## rm       0.011086947 -0.16058718 -0.1572603
## age      0.283482486 -0.38817624 -0.1013491
## dis     -0.079848789 -0.38493775  0.2108038
## rad      3.718978412  0.93123532 -0.4706522
## tax     -0.015197127  0.04230505  1.2889075
## ptratio  0.180294774 -0.01592588 -0.3558490
## black   -0.136724112  0.02948075  0.1288959
## lstat    0.145739238 -0.37823065  0.3345688
## medv     0.061327205 -0.53906503 -0.1509890
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9523 0.0364 0.0113
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

I didn’t really understand the plot but lets move on. Next we will predict using LDA on the test set

true_crime = test$crime
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = true_crime, predicted = lda.pred$class)
##                  predicted
## correct           [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
##   [-0.419,-0.411]              10             10               4              0
##   (-0.411,-0.39]                2             17               6              0
##   (-0.39,0.00739]               1              9              13              2
##   (0.00739,9.92]                0              0               1             27

The different quantiles indicate different crime rates, with [-0.419,-0.411] being the smallest bin and (0.00739,9.92] being the highest one. We predicted the high crime areas very well, since only 1 was incorrectly classified when the true class was the highest crime rate. On the other hand in the smaller crime rates there are more mistakes which indicates they are not as easy to classify. Now lets reload boston dataset to run k-means

boston_scaled <- data %>% mutate_at(c("crim",    "zn" ,     "indus"  , "nox"    , "rm"   ,   "age"  ,   "dis" ,    "rad"   ,  "tax" ,    "ptratio" ,"black"  , "lstat",  "medv"), ~(scale(.) %>% as.vector))
# k-means clustering
km <-kmeans(boston_scaled, centers = 1)

# plot the Boston dataset with clusters
pairs(boston_scaled[5:10], col = km$cluster)

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled[5:10], col = km$cluster)

# k-means clustering
km <-kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled[5:10], col = km$cluster)

Here we saw three different amounts of clusters. Clearly 1 cluster is not useful, but two seems to partition the data nicely in most cases. Three tends to just partition a fairly uniform group into two separate groups which does not feel useful. So here I would argue 2 is a good choice.

Regarding the clustering itself we can see that for example there are two clear clusters in the nox / tax plot which indicates that there is a set of outlier tax rates which are more or less the same.

PCA

date()
## [1] "Fri Dec  3 15:24:45 2021"
library(GGally)
library(ggplot2)
library('FactoMineR')
library('tidyr')
library(dplyr)
human = read.csv('http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt', sep=',')
dim(human)
## [1] 155   8
ggpairs(human, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

There is a major correlation between life expectancy and adolecent birth rate, as well as mortality rates and life expectancy: a higher life expectancy correlates with lower mortality rates, which is expected. Also, the gross national income correlates negatively with mortality and positively with live expectency. It seems there are a few countries with very high mortality and the rest are quite low judging from the distribution of the values.

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
pca_human <- prcomp(human)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

human_std <- scale(human)

# print out summaries of the standardized variables


# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)

The results are very different because the scales of the different variables such as GNI were so different before that they skewed the PCA results. In this plot we see that the first principal component has high values in countries with large maternal mortality and adolescent birth rates, and low values on countries with high education expectency and life expectency. This means that for example Niger has low life expectency but high maternal mortality. On the other hand the second principal component correlates with high labour female to male ratio and females in parliament, which means taht for example Iran has a low amount of women working while Bolivia has a high amount.

library('FactoMineR')
library('tidyr')
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, keep_columns)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(keep_columns)` instead of `keep_columns` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"))

It seems that the variables ‘other’ and chain store+tea shop and most correlated with dimension 2, while tea shop and unpackaged are most correlated with dimension 1.


(more chapters to be added similarly as we proceed with the course!)